Skip to content

Add JetBrains Mellum2 recipes (Thinking + Instruct)#503

Merged
esmeetu merged 2 commits into
vllm-project:mainfrom
esmeetu:add-mellum2-12b-thinking
Jun 2, 2026
Merged

Add JetBrains Mellum2 recipes (Thinking + Instruct)#503
esmeetu merged 2 commits into
vllm-project:mainfrom
esmeetu:add-mellum2-12b-thinking

Conversation

@esmeetu
Copy link
Copy Markdown
Member

@esmeetu esmeetu commented Jun 2, 2026

Adds vLLM recipes for JetBrains' Mellum2 family — the reasoning-augmented Thinking checkpoint and its direct-answer Instruct sibling (both 12B total / 2.5B active, 64 experts / 8 active, 131K context, bf16).

Shared details

  • Architecture: MoE (MellumForCausalLM), bf16, ~29 GB — fits on a single H200/H100/A100. single_node_tp defaults to TP=1.
  • vLLM version: nightly. MellumForCausalLM support merged in vllm-project/vllm#43992 on 2026-06-01, after the latest stable v0.22.0 (2026-05-29), so it is not yet in a tagged release. Both recipes set nightly_required: true.
  • Tool calling: --enable-auto-tool-choice --tool-call-parser hermes (both).

Thinking vs Instruct

  • Thinking emits <think>...</think> chains before the answer → adds --reasoning-parser qwen3. Suited to complex debugging, planning, agentic/math-heavy tasks.
  • Instruct answers directly (no externalized CoT) → no reasoning parser. Lower-latency coding and tool use.

The two recipes cross-link via related_recipes.

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Jun 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 2, 2026 1:03am

Request Review

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a configuration file for the new Mellum2-12B-A2.5B-Thinking model by JetBrains and registers JetBrains as a provider. Feedback on the configuration suggests lowering the default --max-model-len from 131072 to 32768 to avoid potential Out-Of-Memory (OOM) errors on standard GPUs. Additionally, it is recommended to correct a likely typo in the Python client example, changing max_tokens from 81920 to 8192.

Comment on lines +21 to +23
base_args:
- "--max-model-len"
- "131072"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting the default --max-model-len to the absolute maximum of 131072 in base_args will cause vLLM to attempt to allocate a massive KV cache on startup. For a model of this size (12B total parameters, ~24 GB in bf16), the KV cache for 131k tokens will require an additional ~8 GB+ of VRAM. On GPUs near the minimum VRAM requirement of 29 GB (or even 32GB/40GB GPUs), this will likely result in an Out-Of-Memory (OOM) error during initialization.\n\nConsider setting a more conservative default (e.g., 32768 or 16384) in base_args to ensure the recipe runs out-of-the-box on standard GPUs, and document in the guide that users can scale it up to 131072 if they have higher-end hardware (like an A100 80GB or H100).

  base_args:
    - "--max-model-len"
    - "32768"

resp = client.chat.completions.create(
model="JetBrains/Mellum2-12B-A2.5B-Thinking",
messages=[{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."}],
max_tokens=81920,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The max_tokens parameter in the Python client usage example is set to 81920. This is extremely high for a single chat completion response and is likely a typo for 8192 (8k), which is the typical maximum generation length for reasoning models. Setting it excessively high can lead to client-side validation issues or unexpected behavior if the model gets stuck in a loop.

      max_tokens=8192,

Direct-answer sibling of the Thinking checkpoint (no reasoning parser).
Cross-link the two via related_recipes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
@esmeetu esmeetu changed the title Add JetBrains/Mellum2-12B-A2.5B-Thinking recipe Add JetBrains Mellum2 recipes (Thinking + Instruct) Jun 2, 2026
@esmeetu esmeetu merged commit dbe9862 into vllm-project:main Jun 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant